The massive 340 megabase genome of Anisogramma anomala, a biotrophic ascomycete that causes eastern filbert blight of hazelnut

Background The ascomycete fungus Anisogramma anomala causes Eastern Filbert Blight (EFB) on hazelnut (Corylus spp.) trees. It is a minor disease on its native host, the American hazelnut (C. americana), but is highly destructive on the commercially important European hazelnut (C. avellana). In North America, EFB has historically limited commercial production of hazelnut to west of the Rocky Mountains. A. anomala is an obligately biotrophic fungus that has not been grown in continuous culture, rendering its study challenging. There is a 15-month latency before symptoms appear on infected hazelnut trees, and only a sexual reproductive stage has been observed. Here we report the sequencing, annotation, and characterization of its genome. Results The genome of A. anomala was assembled into 108 scaffolds totaling 342,498,352 nt with a GC content of 34.46%. Scaffold N50 was 33.3 Mb and L50 was 5. Nineteen scaffolds with lengths over 1 Mb constituted 99% of the assembly. Telomere sequences were identified on both ends of two scaffolds and on one end of another 10 scaffolds. Flow cytometry estimated the genome size of A. anomala at 370 Mb. The genome exhibits two-speed evolution, with 93% of the assembly as AT-rich regions (32.9% GC) and the other 7% as GC-rich (57.1% GC). The AT-rich regions consist predominantly of repeats with low gene content, while 90% of predicted protein coding genes were identified in GC-rich regions. Copia-like retrotransposons accounted for more than half of the genome. Evidence of repeat-induced point mutation (RIP) was identified throughout the AT-rich regions, and two copies of the rid gene and one of dim-2, the key genes in the RIP mutation pathway, were identified in the genome. Consistent with its homothallic sexual reproduction cycle, both MAT1-1 and MAT1-2 idiomorphs were found. We identified a large suite of genes likely involved in pathogenicity, including 614 carbohydrate active enzymes, 762 secreted proteins and 165 effectors. Conclusions This study reveals the genomic structure, composition, and putative gene function of the important pathogen A. anomala. It provides insight into the molecular basis of the pathogen’s life cycle and a solid foundation for studying EFB. Supplementary Information The online version contains supplementary material available at 10.1186/s12864-024-10198-1.


Background
The investigation of biotrophic fungi -pathogens that require living host tissue -is complex and challenging.Because of their dependency on the host organism, biotrophs are difficult to isolate and grow in artificial media.They often have strict nutritional requirements and may require certain hormones or signaling chemicals secreted by the host to induce spore germination [1,2].Satisfying these conditions complicates any form of manipulation under laboratory conditions.Studies of rust fungi, which are basidiomycetes, and powdery mildew fungi, which are ascomycetes, highlight many of these challenges.Despite the significant economic impact of the resulting diseases, complete life cycles of these fungi have never been witnessed outside of their natural hosts.Consequently, despite substantial effort on the parts of many scientists, many details of host-pathogen interactions in rust and powdery mildew fungi remain poorly understood [3][4][5].
Advances in sequencing and bioinformatic tools have led to the rapid development of genomic techniques that facilitate investigation even of recalcitrant organisms.As the number of sequenced fungal genomes expands, patterns and features that are linked to obligate biotrophy have emerged [6,7].Genomic features, including both coding and non-coding elements, reveal characteristics of lifestyle and pathogen biology [8,9].A large repertoire of species-specific secreted small cysteine-rich proteins that represent candidate effectors is typical of biotrophs that have gene specific interactions with their host [10,11].Large genomes inflated with repetitive elements are another hallmark of biotrophic pathogens, as amplification of such elements contributes to a flexible genomic landscape that is highly adaptable to the gene-for-gene arms race that pathogens engage in with their hosts [11][12][13].Identifying these characteristics of genomic features can fill in the blanks left by a lack of experimental data [14,15].
One such fungal pathogen whose biology lacks understanding is Anisogramma anomala, an ascomycete within the order Diaporthales.A. anomala causes Eastern Filbert Blight (EFB), a devastating disease of hazelnut (Corylus spp.).The native host of A. anomala, American hazelnut (C.americana) tolerates infection, displaying mild disease symptoms and small, non-threatening cankers [16][17][18].Both host and fungus are abundant on the east coast of the U.S.However, nearly all cultivars of the commercially important European hazelnut (C.avellana) are highly susceptible and develop severe perennial cankers that girdle stems, resulting in branch die-back and eventual tree death [19][20][21].As such, EFB is the primary limiting factor of commercial hazelnut production in North America [22].Historically, C. avellana cultivation was restricted to the Pacific Northwest region outside of the native range of A. anomala, limiting hazelnut cultivation to a fraction of its potential growth range [23].Today, after an inadvertent introduction in the 1960s [24], EFB is widespread in the Pacific Northwest where it significantly impacts commercial production.Disease management costs were alleviated only recently by the release of resistant cultivars [25].Despite the economic importance of A. anomala and considerable efforts now underway to breed for resistance [26], the EFB pathosystem remains poorly understood.
To support disease management and resistance breeding efforts, there is a need for a better understanding of the biology of A. anomala and the EFB pathosystem.However, A. anomala is an obligate biotroph, presenting many of the methodological difficulties as do rust fungi, powdery mildews, and other biotrophic pathogens [27].The only useful source of tissue of A. anomala is ascospores extracted from the stromata of cankers of infected hazelnut, and successful subculture has not been achieved.Ascospores represent the only known spore stage of A. anomala; no conidial stage has been documented.Ascospores, by nature, are sexual spores and are not isogenic.While A. anomala ascospores can germinate and form small, branching germ hyphae, the fungus cannot be grown continuously in culture.It is predicted that A. anomala exhibits some form of self-inhibition, as ascospores will germinate in axenic culture only with the addition of an adsorbent such as activated charcoal or bovine serum albumin (BSA) [27].Even with these additives, germinated ascospores exhibit poor growth and form small colonies (~ 0.25-0.5 mm in diameter) that yield little biomass [28].Furthermore, the disease exhibits a complex, two-year infection cycle, which normally includes 15-18 months of latency, in which it is not feasible to visibly identity infected trees (Figure S1) [20,21,[29][30][31].
Despite the challenges to performing experimental host/pathogen research, we saw the importance in understanding more about A. anomala, both as contributions to the U.S. hazelnut industry, and to plant pathogen biology.Due to the lack of an experimental system by which to study A. anomala, we used a genomic approach to elucidate features of EFB biology and pathogenesis.An earlier draft genome of A. anomala was assembled and mined for sequences that would be useful as simple sequence repeat (SSR) primers to examine population biology of the fungus and assist with resistance breeding [28].That study revealed that the genome of A. anomala is surprisingly large, > 300 megabases (Mb) and consists of an abundance of transposons that constitute nearly 90% of the genome sequence.In this study, we present an updated and refined draft of the A. anomala genome sequence, its annotation, and analysis.Genomic analysis reveals characteristics of biotrophy, including a massive population of transposable elements (TEs), bimodal distribution of GC content, and a cache of genes encoding effector molecules.We also identified a number of genes that code for proteins predicted to be involved in pathogenesis and host/pathogen interactions.The annotated genome of A. anomala will serve as a vital resource for future research on the pathogen and EFB disease.

A. anomala has a large, gene-poor genome
The mate-pair and paired-end reads of genomic DNA for Anisogramma anomala OR1 generated over 31 Gb of data that were assembled into a 342,525,599 nucleotide (nt) genome with an average 91 × coverage (Table S1).The final assembly was distributed across 112 scaffolds with a GC content of 34.46%.Four scaffolds with a combined length of 27,247 nt were removed from further analysis as contamination, resulting in a final assembly size of 342,498,352 nt across 108 scaffolds.More than half of the assembly (N50) was on 5 scaffolds with an N50 scaffold length of 33.3 Mb.The largest scaffold was 43.9 Mb (Table 1).Nineteen major scaffolds (> 1 Mb) represent over 99% of the genome.This demonstrates a marked improvement over the first version of the assembly, published in 2013 [28] (Table S2).We identified telomere sequences (repeats of TTA GGG ) on both ends of the second and third largest scaffolds with lengths of 40.1 and 39.2 Mb respectively, indicating these two scaffolds represent full-length chromosomes.Telomere sequences were also found on one end of 10 other scaffolds.Of the 19 largest scaffolds, telomere sequences were found on one end or both ends in 10 scaffolds (Fig. 1).On the contig level, the N50 was 196,655 bp and the L50 was 528 (Table 1).
To evaluate the completeness of the A. anomala genome assembly, we performed flow cytometry using nuclei released from 8-week old mycelium.Based on flow cytometry, the genome size of A. anomala OR1 was estimated to be 370 Mb (Figure S2), slightly more than, but consistent with the genome assembly estimate.
Using a combination of RNA-seq evidence and ab initio gene prediction, we predicted 9,179 protein coding genes in the A. anomala genome.This gene set includes 94.4% of eukaryotic benchmarking universal single-copy orthologs (BUSCOs) and 95.5% of fungal BUSCOs.Average gene density on major scaffolds was approximately 25.8 genes/Mb and remained relatively consistent among major scaffolds (Fig. 1).
Gene models were annotated with Gene Ontology (GO) terms merged with InterPro IDs.Eighty-eight percent of gene models had BLASTp hits against the NCBI nr database.Approximately 75% of gene models have  been annotated by biological process and 50% with a molecular function (Fig. 2, Table S3).Gene models were also annotated with KEGG Orthology (KO) terms, using a combination of the KEGG Automated Annotation Server and BlastKOALA.Roughly 38% of protein sequences were assigned KO identifiers, which make up 99 complete or nearly complete KEGG pathways (Table S3).

A. anomala has large arsenal of effectors and CAZymes
To identify proteins that may be involved in virulence and disease, we identified genes that code for potential effectors, molecules that are involved in host/pathogen interactions.We first identified 762 proteins with signal peptides as evidence of a secreted protein.Those proteins were then analyzed with EffectorP 2.0 to further predict potential effector proteins.One hundred and sixty-five proteins (1.8% of total proteins.21.7% of secreted proteins) were predicted to be effector candidates (Table 1).All effector candidates were subjected to a BLASTp search of the NCBI nr database.Over half (55%) of candidate effectors returned no BLAST hit, and of those that did return a hit, 42% were hypothetical proteins or proteins with unknown function.For those effector Fig. 1 Repeat content and gene density distribution across major scaffolds (> 1Mb).Size of bar reflects the length of the scaffold (x-axis).Repeat density of dispersed repeats in bins of 100kb is represented as a heat map, ranging from white to black, the darker the color indicating higher repeat density.The height of each scaffold bar (y-axis) ranges from 0 to 200 genes/Mb with the average gene density of the scaffold plotted in orange.Gene density was calculated per 100kb and plotted along each scaffold in purple candidates that match a protein with a known function, possible roles include one glycoside hydrolase, one cutinase, and two peptidases (Table S4).Genes encoding putative effector molecules were evaluated for their proximity to the closest repeat element and the closest large RIP affected region (LRAR) as predicted by RIPPER.BUSCOs and a random subset of all genes were included for comparison (Fig. 3).On average, effectors were approximately 1.5 kb from the nearest TE while BUSCOs and a randomized set were 3 kb and 2.5 kb respectively.The closest distance to LRARs for effectors, BUSCOs, and the randomized set did not differ significantly from each other and averaged at 9900 bp, 9400 bp, and 9700 bp respectively.
Genome and annotation statistics including genome size, repeat content, and different categories of protein coding genes (effectors, CAZymes, and biosynethetic gene clusters) were compared to related fungi (Table 2).Like other biotrophic fungi, A. anomala has a large genome with high repeat content (shown below).A large number of effectors (relative to total protein coding genes), small number of biosynthetic gene clusters and CAZymes are other hallmarks shared between A. anomala and related biotrophic fungi.Bold font indicates missing information from the respective study, numbers were predicted using the same methods used in the current study a Indicates missing CDS data, Biosynthetic gene clusters were predicted using genomes only

A. anomala genome hosts a large population of transposable elements (TEs)
The A. anomala genome hosts a large population of TEs that accounts for approximately 88% of the final genome assembly (Table 3).Repeat content remained relatively constant at 88% across major scaffolds (Fig. 1).The TE population consists of 2,536 individual repeat families, making up over 300,000 individual interspersed elements (Table 3).The vast majority (90%) of repetitive sequences was comprised of Long Terminal Repeat (LTR) retrotransposons, mostly Copia-like elements, which alone account for over half of the genome assembly.Eight of the ten repeat families with the highest copy numbers (> 7,000 members each) were identified as Copia-like elements.

A. anomala exhibits "two-speed" genome
The overall distribution of GC-content across major scaffolds remained relatively constant at approximately 34%.However, measurement of proportions of GC-distribution across the entire genome reveals two peaks, indicating a bimodal genome (Fig. 4).The first peak, at 32.9% GC indicates AT-rich regions.This peak accounts for 93% of the genome and 10% (933/9,179) of the protein coding genes.These AT-rich regions are gene poor, with an average gene density of 2.93 genes/Mb.The second peak, at 57.1% GC indicates GC equilibrated regions that account for 7% of the genome and 90% (8,246/9,179) of protein coding genes.These GC-equilibrated regions are over 100-fold more gene-dense with an average of 344 genes/Mb.We performed an enrichment analysis using the Fisher's exact test of the gene models within AT-rich genomic regions (Table S6), sheet 1).A number of GO terms are over-represented (p-value < 0.05) including beta-glucan/cellulase metabolism, peptidase/hydrolase activity, and ion transport (Table 4).Additionally, despite these regions encoding only 10% of protein coding genes, 30% (49/165) of predicted effector coding genes were found in these AT-rich hotspots.

A. anomala exhibits a number of unique gene families
We performed an Orthofinder analysis to identify gene families shared with related fungal pathogens (Table S7).A super-gene phylogeny was constructed using 34 singlecopy orthologous gene families and their corresponding protein sequences.Gene family counts were used to reconstruct ancestral gene family content and gain/loss of homologous gene families with Wagner parsimony and stochastic mapping (Fig. 5).
There are 1,121 gene models that are not identified as orthologous to related fungal pathogens and are likely specific to A. anomala (Table S6), sheet 2).Of these unique genes, 83 of them are predicted to code for effectors, indicating that over half of the predicted effectors are unique to A. anomala.GO terms overrepresented include beta-glucan and cellulose metabolism (p-value < 0.05), suggesting a role in production of plant degrading compounds (Table S8).An additional 450 GO terms are underrepresented, mostly including processes involved in central metabolism and fungal growth and development.
The Orthofinder analysis and Wagner parsimony revealed 354 genes families gained and 721 lost in A. anomala since diverging from its last common ancestor with C. parasitica.Gene families that are expanded or gained in A. anomala account for an additional 32 putative effector genes-meaning that approximately 70% of putative effectors are in species specific gene families or lineages of gene families that have expanded in A. anomala.GO terms overrepresented in gained/expanded gene families include catabolic processes and degradation of organic compounds (Table S9).The GO terms that are underrepresented include protein, organelle, and cellular biosynthetic processes.

Transposable elements show evidence of Repeat-induced point mutation (RIP)
The A. anomala genome encodes two genes that exhibit sequence homology and are orthologous to rid (RIP defective) in Neurospora crassa.The two genes are predicted to encode a C5-DNA methyltransferase and a modification methylase respectively.Both genes have been assigned GO terms for methyltransferase activity.A. anomala also encodes a homolog of dim-2, an additional methyltransferase identified in N. crassa to be involved in the RIP process (Figure S3).
Dinucleotide frequencies and RIP indices were calculated for a subset of up to 100 members for all identified repeat families (Fig. 6a).Compared to a control of nonrepeat sequences, repeat sequences exhibit an over-abundance of TpA (6.7 × more frequent) and TpT (4.1 × more frequent) dinucleotides and under-abundance of GpC (3.9 × less frequent) and CpG (2.8 × less frequent) dinucleotides.RIP indices were also calculated for the same subsets of repeat families (Table 5).The mean TpA/ApT index for repetitive sequences is 1.14, while non-repeat sequences have an index of 0.48.The mean (CpA + TpG)/ (ApC + CpT) index is 0.087 in repetitive sequences and 0.095 in non-repetitive sequences.There were no significant differences in dinucleotide frequencies or RIP indices between repeat classes.
An alignment-based RIP analysis of the repeat family with the highest copy number shows that A. anomala exhibits two dominant kinds of RIP (Fig. 6b).CpA→ TpA and CpT→ TpT mutations were dominant over other RIP-like mutations.The top 10 repeat families with the highest copy number were also analyzed with the alignment-based RIP analysis and demonstrate the same RIP mutational preference.

A. anomala demonstrates genetic basis for homothallism
Homologs for both MAT1-1 and MAT1-2 idiomorphs have been identified in the A. anomala genome within the same 7 kilobase cluster (Table S6, sheet 3), consistent with evidence that the fungus is homothallic [20].Homologs for the mat genes were identified through a BLASTp search of the NCBI nr database and verified by a pairwise sequence comparison to the corresponding genes in Cryphonectria parasitica [56].Like the C. parasitica idiomorphs, three protein-coding genes are predicted to constitute MAT1-1 (MAT1-1-1, containing an alpha box motif; MAT1-1-2, a protein of unknown origin; and MAT1-1-3, containing an HMG motif ), and a single protein-coding gene is predicted for MAT1-2 (MAT1-2-1, also containing an HMG motif ).Within the A. anomala MAT locus, the gene encoding MAT1-2-1 was embedded between MAT1-1-1 and MAT1-1-2.Other genes usually associated with mating clusters in fungi, apn2 and sla2, were identified in close proximity to the other MAT protein coding genes.The entire MAT cluster is largely syntenic to that of Chrysoporthe cubensis, a closely related homothallic fungus.The MAT loci of A. anomala is more compact and contains no additional genes besides those directly involved in determining mating type (Fig. 7).RNAseq data indicate that all four of these MAT genes were expressed constitutively (Figure S4).

Discussion
The final genome assembly of A. anomala OR1 is approximately 343 Mb.This assembly is thought to be relatively complete based on genome size estimation compared to flow cytometry data as well as identified BUSCOs.The A. anomala genome is very large by fungal standards,    [11], and rust fungi, which are basidiomycetes often with complex life cycles, may have genomes approaching 1 Gb [59].Both of these unrelated fungi are subjected to the strong selective pressure imposed on biotrophic plant pathogens to maintain an intimate interaction with their host while avoiding recognition that initiates an immune response [60,61].The outcome of evolution driven by the pressure of a host/pathogen arms race is parallel adaptations resulting in remarkedly similar genomes among biotrophic pathogens [7,11,[32][33][34][35][36][37][38][39].The expansion of the A. anomala genome is driven by the proliferation of TEs, rather than accumulation of protein coding genes.The TE population is made primarily of LTR retrotransposons.Copia-like elements are by far the most abundant, which contrasts related fungi that are dominated by Gypsy-like repeats [48].Despite the massive number of identified LTR retrotransposons, no single element has been determined to be intact with both 5' and 3' LTRs and the protein domains required for autonomous transposition, namely reverse transcriptase (RT), RNAse H (RH), and integrase (INT) [62].One of the looming questions regarding the TEs in the A. anomala genome is how the invasion and uncontrolled replication of repetitive elements, largely of a single type of TE, was responsible for such extreme genome expansion.
Effector molecules play an important role in the colonization of biotrophic plant pathogens.Plants are able to recognize specific effector molecules through resistance (R) genes and activate a powerful hypersensitive response (HR) resulting in plant cell death which halts the spread of the invading pathogen [63].If recognized, the pathogen is considered avirulent and the effector protein that triggers the HR response is characterized as an avirulence (avr) gene [64].The pathogen responds by mutating or losing avr genes, so that they are no longer recognizable, or developing new effectors that avoid or suppress the effector-triggered immune response [65].This relationship is the basis of the coevolutionary arms race between host plants and pathogens [66].Most commercially available cultivars of C. avellana are protected by the R-gene "Gasaway", named after the pollinizing cultivar that carried the dominant allele [67].There is evidence that the Gasaway R-gene protects through an HR response [21].However, Gasaway protected plants are overcome in regions of high pathogen pressure and diversity, suggesting that effectors and avr genes play a role in the breakdown of resistant cultivars [68][69][70].
For many putative effectors, there is no known function.As the goal is to be unrecognizable, there is no benefit to maintain conserved effector genes.However, we know that effectors can play multiple roles in establishing and maintaining infection [71,72].The large arsenal of putative effectors encoded in the A. anomala genome allows for flexibility.This is also why we have observed effectors in repeat-rich regions of genome, where there are high rates of mutation and recombination [73].The compartmentalization of effectors and genes involved in pathogenicity in repeat rich regions fits the "two-speed" model of evolution [74].
CAZymes play an important role in both necrotrophic and biotrophic phytopathogenic infection.Necrotrophic fungi are known for having an arsenal of plant cell wall busting enzymes to launch an aggressive attack on their host.Necrotrophic ascomycetes code for between 600-800 CAZymes while A. anomala codes for 456 putative CAZymes, a typical number for biotrophic pathogens [75].One of the most notable families of CAZymes encoded in the A. anomala genome is the glycoside hydrolase-18 (GH-18) family that includes all identified fungal chitinases [76].It is predicted that both plant and fungal cell wall degrading enzymes are important for establishing biotrophic infection.Histological data of early infection of A. anomala on C. avellana shows a single germ hypha penetrating the plant cell wall, followed by the formation of intracellular vesicles [21].CAZymes are required for initial penetration of the plant cell wall as well as the reformation of the fungal cell wall at the fungal/host interface [77].Reasonable future steps would include investigating the expression of CAZymes during early infection to elucidate what genes are required to establish infection.
The Wagner parsimony analysis on Orthogroups of related fungal pathogens revealed the loss of 876 and gain of 285 gene families in A. anomala since it diverged from C. parasitica.Gene reduction in obligate parasites is a common trend, usually due to the loss of specific metabolic pathways as the parasite derives required compounds from their host [78].KEGG pathway reconstruction revealed a number of missing or incomplete pathways for the biosynthesis of several amino acids including lysine, tryptophan, and asparagine [79].It should be noted that the culture medium used for A. anomala contains yeast extract as well as additional asparagine to encourage growth [27].Genes involved in pathways involved in energy generation (NADH dehydrogenase, nitrate reduction/assimilation) are missing as well.A. anomala exhibits parallel evolution to unrelated obligate biotrophic fungal pathogens that have independently lost similar biosynthetic and metabolic pathways [60,80,81].
The exception to the trend of gene loss is genes or gene families that encode effectors.Gene families that are unique or expanded in A. anomala contain 70% of predicted effectors.Biotrophs use effectors to maintain an intimate signaling relationship during infection [82,83].The need for a large and diverse effector arsenal drives the evolution of effector diversification and expansion [84,85] as we observed with A. anomala.GO terms overrepresented in unique or expanding families include transmembrane transporters that are involved in the secretion of secondary metabolites that participate in pathogenesis.Amylases, peptidases, and catabolic activity GO terms are overrepresented in expanded families, likely aiding in adaptation to the obligate biotrophic lifestyle [10].
Despite TEs accounting for 88% of the final genome assembly, very few of these elements contain intact protein domains required for autonomous transposition.Sequences from all identified repeat families show evidence of RIP mutation.RIP is a defense mechanism that protects fungal genomes from TEs expanding unchecked [86,87].RIP functions by recognizing stretches over 400 bp of DNA with high (> 80%) sequence identity.The DMNT-1 homologue RID (RIP defective) methylates cytosine residues, which then undergo spontaneous deamination into thymine.This induces C→ T and G→ A transitions in both copies of duplicated sequences, resulting in permanent mutational changes in the DNA sequence [88,89].
Fungi that demonstrate evidence of RIP vary in the degree or effectiveness by which RIP acts on the genome.Neurospora crassa, in which RIP was first described  7 Genomic region corresponding to mating-type locus in A. anomala.Gene models were identified as MAT homologs through BLASTp search of NCBI database and analyzed for synteny compared to Chrysoporthe cubensis, a homothallic fungus in the Cryphonectriaceae [90], has a very efficient RIP system, to the point where N. crassa has almost no duplicated sequences, TEs nor duplicated genes [91].In other fungi, RIP is often demonstrable, but less effective [92].In the related fungus C. parasitica, roughly 14% of the 43.9 Mb genome represented TEs, and there was some limited evidence for RIP [48,92].The A. anomala genome exhibits indications of RIP activity.The RIP indices calculated for repeat families exceed the accepted threshold for RIP activity (TpA/ ApT ≥ 0.89 and (CpT + ApT)/(ApC + CpT) ≤ 1.03) [93,94] and the dinucleotide frequencies demonstrate a depletion of pre-RIP dinucleotides and an enrichment of post-RIP dinucleotides in repeat regions compared to non-repeat regions [92].In spite of evidence of a functional RIP pathway, transposons have managed to overtake the A. anomala genome.The massive expansion of the TE population in the A. anomala genome underscores the observation that mere presence of an apparently functional RIP system is no guarantee that TEs will be held in check.
In addition to defending against the uncontrolled replication of TEs in a genome, RIP is a major driver of genome evolution.RIP induces mutations on duplicated sequences, but those mutations often bleed into neighboring regions, so called "leaky RIP" [86,95,96].Furthermore, the G/C→ T/A mutations have a major impact on GC content of a genome.The GC content of the A. anomala genome is relatively low, at 34%, however, it is not equally distributed across the genome.GC-proportion distribution reveals two peaks; 93% of the genome landscape has a GC-content of 32%.These stretches of GC-poor containing DNA are broken up by GC-rich blocks that are gene-rich and TE-poor.These data demonstrate that A. anomala fits the "two-speed" genome model [96][97][98].
Analysis of the mating-type locus revealed that A. anomala has the genes for both MAT1-1 and MAT1-2 idiomorphs, providing molecular evidence to support the previous evidence for homothallism [20].Mating type systems, processes, and their associated genes are extraordinarily complicated in fungi, and many genes other than the MAT genes themselves may have different roles in the reproductive process [99].In addition to controlling sexual development, MAT genes may be important in growth and virulence, including regulation of secondary metabolites and hyphal morphology.
Homothallism, such as that in A. anomala, is thought to be an evolutionary destination from which there is no likely return to a progenitor heterothallic state, an idea that was supported through research with Neurospora shifting multiple times from heterothallic to homothallic lifestyle, but never the reverse [100].Chrysoporthe which is closely related to Anisogramma, has MAT locus features similar to Neurospora, including pronounced influence of retrotransposons, but there was some evidence to suggest that the MAT1-2 and MAT1-1 idiomorphs of the heterothallic C. austroafricana evolved from a homothallic progenitor [101].In both the case of Neurospora and Chyrosporthe, the evolutionary transition of mating type is facilitated by TEs within the mat locus.Like the rest of the A. anomala genome, the mat locus is flanked by TEs.But the core genes for each idiomorph are found within the same 7 kb block with no TEs or additional genes.The mating cluster of C. cubensis includes additional genes not found in A. anomala as well as a 200 kb insertion of DNA that contains over 60 genes not related to determination of mating type (Fig. 7).One of the roles of sex in fungi and other organisms is to bring genetic variation to the species.It seems that a combination of homothallic sex and rampant genome invasion and expansion by transposons brings sufficient variability to A. anomala.

Conclusions
At nearly 350 Mb, the A. anomala genome represents the largest ascomycete genome yet characterized.Gene number and putative functions are typical of fungal plant pathogens, but runaway amplification of repeat sequences has led to a massively bloated genome, despite hallmarks of functional genome surveillance by RIP.The A. anomala genome characterization will serve as a resource for others investigating this economically important plant pathogen, and for those interested in fungal genome evolution.

Fungal strain
A. anomala is an obligate biotroph that has not been grown in continuous culture, so tissue is scarce and not clonal.Based on knowledge of EFB epidemiology [102,103], the A. anomala population in Oregon is believed to be decedents of a single introduction event from east of the Rocky Mountains and belong to a single lineage.But mycelium from different trees in fields is not clonal, and DNA or RNA extracted from a collection of germinated ascospores is also not clonal.The closest approximation we have to homogeneous tissue is to harvest ascospores from a single canker on a single tree, with the understanding that it most likely represents the result of a single infection.We collected ascospores from individual cankers from infected branches harvested from hazelnut plants growing at the Oregon State University Smith Horticultural Research Farm, Corvallis, OR.These plants had been inoculated 18 months prior in the greenhouse using local diseased plant material as inoculum source.We designate the strain presented here Oregon1, OR1.
Ascospores were extracted following the protocol we used previously [28].Briefly, the branches were cut into pieces 5-7 cm in length, and surface-sterilized for 3 min in 10% bleach (0.525% sodium hypochlorite) followed by 1 min in 70% ethanol.After rinsing with sterile H 2 O, the stromata were hydrated in sterile H 2 O for 30 min and airdried.The top of a canker was cut off with a sterile razor blade to expose the necks of the perithecia, and another sterile razor blade was inserted under the perithecia to provide pressure from below and push ascospores out of perithecial neck.The spores from individual cankers were suspend in sterile H 2 O containing 10 ppm rifampicin and 100 ppm streptomycin and quantified with a hemocytometer.We found one canker that produced approximately 5.5 M ascospores and these spores were used in this study unless noted otherwise.
To generate primary mycelium, a portion of the ascospores was adjusted to 1 × 10 5 spores per ml and used to inoculate plates of culture medium overlaid with cellophane.The rest of the ascospores were stored at -80 °C.Half a milliliter of the spore suspension was spread on the cellophane surface in individual 9-cm diameter petri dishes.The medium contained (per liter) 2.7 g modified Murashige and Skoog basal salt mixture; 20 g sucrose; 2 g yeast extract; 2 g L-Asparagine; 15 g Bacto agar; 0.25 g activated charcoal; and 10 mg Rifampicin [27].The cultures were grown at 18 °C in the dark for 8 weeks, by which time many spores had germinated and grown into opaque, whitish colonies approximately 0.25-0.5 mm in diameter.Mycelium was harvested by rinsing the cellophane with sterile H 2 O.A subset of plates was kept for four more weeks.By then the small colonies were turning grey and black, and the senescent mycelium was harvested as described above.

Nucleic acid extraction, genome sequencing and assembly
Mycelium from 8-week-old cultures were used for DNA extraction using Gentra Puregene kit (Qiagen) following the fungi protocol.One paired-end DNA library with insert size approximately 350 bp (excluding adapters) was constructed using the TruSeq DNA Sample Prep kit (Illumina).Three mate-pair DNA libraries with insert sizes approximately 3 kb, 6 kb, and 10 kb, respectively, were constructed using the Nextera Mate Pair Library Prep kit (Illumina) following manufacturer's instructions.All libraries were sequenced on the Illumina MiSeq platform.
The paired-end reads were trimmed with Trimmomatic v0.32 [104] in paired-end mode to remove adapter sequences and reads shorter than 100 bp after trimming were dropped.The mate-pair reads were first trimmed with Trimmomatic in paired-end mode to remove external adapters, then trimmed with Trimmomatic in single-end mode to remove internal adapters at ligation junctions.Reads shorter than 35 bp after trimming were dropped.The resulting reads were processed with a custom Perl script and only read pairs meeting the following conditions were retained for genome assembly: 1) both reads must have survived adapter trimming; 2) for read pairs in which external adapters were found, the junction adapter must be found in both reads; 3) for read pairs in which external adapters were not found, junction adapter must be found in at least one read.After data processing, the sequence reads were assembled using AllPaths-LG release 52,155 with default settings.[105].Assembled scaffolds were subjected to a BLASTn search of the Gen-Bank database release 258 [106].Any scaffolds where the top hit was not fungal were removed as contamination.

Flow cytometry
One hundred micrograms of freshly harvested 8-week old mycelium were cut into fine pieces with a sterile razor blade in 500 μl LB01 buffer on ice to release the nuclei [107].The mixture was passed through a 40 μm filter and washed with 200 μl LB01 buffer.Nuclei from 50 mg young radish leaf, which has a 2C genome size of 1.1 Gb, were released the same way and used as control.Nuclei solutions were treated with RNase A and stained with propidium Iodide at room temperature for 20 min in darkness and run through a Beckman Cytoflex flow cytometer.The experiment was repeated three times.

Repeat identification and masking
The assembled genome was soft-masked prior to gene prediction [108].A comprehensive, non-redundant repeat library was created by integrating output from RepeatModeler [109,110], TransposonPSI [111], and LTRharvest [112].RepeatModeler v1.0.11 and Trans-posonPSI were run using default parameters to generate the first two repeat libraries.The third repeat library was built using LTRharvest.False positives were removed from the LTRharvest library by running LTRdigest with protein HMMs from Pfam [113] and GyDB [114] databases.LTR retrotransposons without domain hits were removed from the LTRharvest repeat library.
Each of the three repeat libraries was classified using RepeatClassifier, part of the RepeatModeler program suite, with Repbase version 23.08 [115], for consistency in identification and naming of repeat elements.The three repeat libraries were then merged and clustered with CD-HIT [116] at ≥ 80% identity to create a non-redundant library [117].This custom library was used to softmask the A. anomala genome using RepeatMasker with the "xsmall" argument and default parameters [118].

Transcriptome sequencing, gene prediction and annotation
Ascospores, 8-week old mycelium and 12-week old senescent mycelium were used for RNA extraction using the Plant RNeasy kit (Qiagen) following manufacturer's instructions.Three mRNA libraries, one for each sample, were prepared using the TruSeq RNA Preparation kit (Illumina) following manufacturer's instructions.The libraries were sequenced on the Illumina MiSeq platform.
Gene models were predicted using the BRAKER2 annotation pipeline [119], incorporating GeneMark-ET [120,121] and Augustus [122] for ab initio and evidencebased gene prediction.RNA-Seq reads were mapped to the genome assembly using STAR [123].The RNA-Seq mapping results were used as evidence for gene prediction in the BRAKER2 pipeline, using the "fungus" argument for fungal gene prediction.Genome completeness was assessed through a BUSCO analysis of benchmarking eukaryotic and fungal single-copy orthologs [124].
Blast2GO v5.2.5 [125] was used to perform a BLASTp search of the NCBI nr database with E-value cutoff of 1e-3.Interproscan v5.53 [126] results were imported in to Blast2GO and merged with GO annotations.KEGG annotation terms [79,127] were assigned using a combination of BlastKOALA v2.2 [128] and the KEGG Automatic Annotation Server (KAAS) [129] searched against eukaryote and prokaryote KEGG GENES databases (release v89.1), with the single-directional best hit method.
Secreted proteins were predicted using SignalP 5.0 [130] to identify signal peptides sequences.Predicted secreted proteins were then analyzed with EffectorP 2.0 [131,132] to predict genes encoding for potential effectors.Evidence including protein size and cysteine content was used for effector prediction.Potential function of effectors was evaluated by a BLASTp search [133] of the GenBank nr database (release 239) [106] with an e-value cutoff of 0.001.Functional domains were assigned using CD-Search webserver with default settings against the Conserved Domain Database v3.20 [134,135].Carbohydrate active enzymes were predicted using the dbCAN3 meta server [136][137][138] which integrates HMMER [139], DIAMOND [140], and Hotpep [141] searches of the CAZy database [142].Biosynthetic gene clusters were predicted and identified using antiSMASH v5.1.2[143].

GC-content distribution
Analysis of GC-content was performed by segmenting genomic sequences into regions of differing GC-content using the Jensen-Shannon divergence at each sequence position calculated using OcculterCut v1.1 [144].Gene models associated with AT-rich genomic regions were used as a test set in a GO term enrichment analysis test using a two-tailed Fisher's Exact Test with a filter value of 0.05 with BLAST2GO v5.2.5 [125].

Fungal super-gene phylogeny
We collected proteomes from 24 related ascomycete species to identify orthologous gene families.OrthoFinder v2.2.6 [145] was used under default settings to build orthogroups.Thirty-four single-copy orthologous gene families and their corresponding protein sequences were retrieved and aligned with MUSCLE v3.8.31 [146] and alignments were trimmed with TrimAl v1.4 [147] using the automated feature to select the best method.The trimmed alignments were automatically concatenated and partitioned using IQ-TREE v1.7-beta17 [148,149].The maximum likelihood tree was reconstructed with IQ-TREE under the LG + I + G model as selected using ModelFinder [150].
Gene family counts from the Orthofinder analysis were used to reconstruct ancestral gene family content and gain/loss of homologous gene families.These traits were reconstructed using Wagner parsimony in the Count software package [151] as well as stochastic mapping with GLOOME [152].

RIP analysis
RIP indices of individual repeat copies were calculated in RStudio v1.1.414[153] using a custom R (v4.1.2) script (file S1) and the Biostrings package v2.62.0 [154].RIP-CAL v2 [155] was used for alignment based analysis of repeat families and calculations of mutation frequencies.Large RIP affected regions (LRARs) were identified by a minimum of seven consecutive sliding windows (window size = 1000 bp, slide size = 500 bp) with a minimum RIP product value of 1.1, maximum RIP substrate value of 0.75 and minimum composite (product -substrate) value of 0.01.RIP product, substrate, and composite values and LRAR analysis was performed using The RIPper [156].

Fig. 2 Fig. 3 A
Fig. 2 Annotation of A. anomala predicted gene models by Gene Ontology (GO) categories.Functions of genes are shown by biological process and molecular functions

Fig. 4
Fig.4 Distribution of GC-content across the A. anomala genome.The genome was broken up into regions using Jensen-Shannon divergence, for which GC-content was calculated.Proportions of the genome were assigned to GC-content in 1 percent increments

Fig. 5
Fig. 5 Predicted pattern of gene family gain and loss in representative fungal genomes.Cladogram representation of Maximum Likelihood phylogeny of A. anomala and 15 related fungi based on 2,800 single copy orthologues.The total number of protein families in each species or node is estimated by Wagner parsimony and stochastic mapping.The numbers of the branches correspond to gene family gain (green) or loss (red) and inferred ancestral protein families (in oval).The numbers of gene families, unassigned genes, and total gene numbers are indicated for each species

Fig. 6 a
Fig. 6 a Log(10) of average fold change in dinucleotide frequencies of all repeat families compared to non-repetitive control sequences.b Alignment based RIP analysis of repeat family with highest copy number (rnd-1_family-0).Each type of RIP mutation is represented by a different color, demonstrating the most dominant types of RIP within this repeat family

Table 1
Table of assembly statistics and feature summary for the haploid genome of A. anomala

Table 2
Genomic statistics of related fungi used for

Table 3
Detailed breakdown of the repeat population in the A. anomala genome.Repeat elements are classified by the name assigned by RepeatClassifier

Table 4
GO term enrichment analysis of genes within the AT-rich regions of the A. anomala genome

Table 5
Calculated RIP-indices for the five repeat families with the largest copy number.A TpA/ApT index ≥ 0.89 and (CpT + TpG)/ (ApC + GpT) index ≤ 1.03 indicate RIP activity.The numbers presented are the mean values calculated for a subset of 100 repeat family members